Gene-Based Association Mapping for Dental Caries in The GENEVA Consortium

Objective: Dental caries is a multifactorial disease with high prevalence in both children and adults. Recent genome-wide association studies (GWASs) have revealed that genetic factors play an important role in caries incidence. However, existing methods are not sufficient to identify caries-associated genes, due to the complex correlation structure of caries GWAS data, and lack of appropriate summarization at the gene level. This paper attempts to address that by analyzing data from the Gene, Environment Association Studies (GENEVA) consortium. Methods: We investigated gene-based genetic associations for dental caries based on genome-wide data derived from the GENEVA database, with adjustment to covariates, linkage disequilibrium among single-nucleotide polymorphisms, and family relations, in sampled individuals. Results: Several suggestive genes were identified, in which some of them have been previously found to have potential biological functions on cariogenesis. Conclusions: By comparing the gene sets identified from gene-based and SNP-based association testing methods, we found a non-negligible overlap, which indicates that our gene-based analysis can provide substantial supplement to the traditional GWAS analysis.


Introduction
Dental caries is a chronic, transmissible disease caused by activities of bacteria [1]. Among the common disorders in humans, caries has the highest prevalence, affecting 2.43 billion people worldwide in their permanent teeth and 620 million children in their primary teeth [2]. It is known that multiple factors contribute to a person's risk for caries development and progression. Besides environmental factors (diet, oral hygiene, etc) and host factors related to subjects' oral conditions, genetic factors have been shown to play an important role in caries etiology [3][4][5].
Analyzing genome-wide data generated from epidemiological studies of dental health opens up cost-effective opportunities to uncover how genetic factors affect inherited risks for dental caries. Previous genome-wide association studies (GWASs) for dental caries have identified several genes that may be associated with caries heritability [6], including enamel formation genes such as AMBN, AMELX, ENAM [7][8][9], taste receptor genes such as TAS2R38, TAS1R2 [10,11], and genes related to immunity such as HLA [12,13] and saliva such as PRH1 [14]. Although considerable efforts have been made to capture GWAS signals for dental caries, they are often limited to marginal association between caries traits and each individual single nucleotide polymorphism (SNP), hence may have limited power at the gene level (especially when evaluating the genetic effect of low-frequency minor alleles). Moreover, due to genotyping differences in different GWASs, findings from such analyses may exhibit inconsistencies, which pose difficulties for biological interpretations. On the other hand, SNP-set-based or gene-based association analyses [15][16][17][18][19] are receiving increasing attention since genes are the functional unit of the human genome and remain highly consistent across diverse human populations. However, most of these gene-based methods are not sufficient to account for linkage disequilibrium (LD) among SNPs, i.e., the non-random association of alleles at different loci. In fact, since caries GWASs are often conducted on related individuals, e.g., family trios or pedigree samples, it is critical to model the complex genotypic correlations caused by both LD and familial relation in order to improve the power of association testing.
In this paper, we propose to use a novel gene-based association mapping strategy to identify caries-associated genes based on genome-wide data from the GENE enVironment Association studies (GENEVA) consortium. Several suggestive genes were identified, in which some (such as PTPRD) have been previously found in other single-SNP-based GWASs to play plausible biological roles for dental caries. An interesting finding was obtained by comparing the gene sets identified separately from gene-based and SNP-based tests. The non-negligible overlap between these two sets suggests that the two types of analyses are not independent of each other and therefore results from gene-based association analyses may contain important signals relevant to cariogenesis, complementing those from the traditional single-SNP-based GWASs.

Materials and Methods
Dental caries data from the GENEVA program Initiated in 2006, the GENEVA program aims to enhance the benefit of collaborative work to further maximize knowledge obtainable through GWAS [20]. One participating study (dbGaP accession: phs000095.v3.p1) of this program collected dental caries data for 5,291 phenotyped and 4,020 genotyped individuals (including cohort of families and unrelated individuals). The major phenotype variable "Tot_D1MFT" describes the status of subjects' teeth, which was derived by counting all teeth with caries (including white spots). Specifically, a tooth is recorded as a D if a carious lesion(s) or both carious lesion(s) and a restoration are present; M if a tooth has been extracted due to caries; and F if a permanent or temporary filling is present. In addition to the tooth-level caries index, another index at tooth surface level, "Tot_D1MFS", is also generated in a similar fashion. Eight sociodemographic and environmental variables are collected as covariates, including sex, age at the time of dental exam, source of water (city/public, well and other), tooth brushing frequency, saliva flow rate (calculated as the volume of saliva divided by 3 minutes: in unit ml/min); home water source fluoride level (obtain from home tap water: in unit ppm), educational attainment (up to high school, some college, four-year degree or beyond), presence of oral S. mutans present (yes, no). The genotype data were collected for a total of 589,735 SNPs on an Illumina BeadChip, among which 289,629 (49.11%) are located in 26,449 entrez genes (according to the human gene Ensembl dataset GRCh37). Based on the cleaned genotype data, a total of 20,197 (76.36%) entrez genes were found to contain more than one SNP in the genotype data As part of the quality-control procedure, the genotyped individuals were further filtered by excluding those who met either of the following two criteria:

2.
Completeness ≤ 96%, where completeness is defined as the proportion of SNPs for which a given individual had genotypes called. Under these criteria, a subset of 652 samples from the GENEVA dental caries data, which have complete data in both phenotypes and genotypes, were selected for the analysis.
Among these 652 participants, 571 are from 201 families (nuclear families or parentoffspring trios), and the rest are unrelated individuals. Quality control on SNPs was conducted based on the following conditions:
To perform the gene-based test, the entire genotype matrix (rows corresponding to individuals and columns corresponding to SNPs) was split according to the starting and ending positions of each gene.
Each partition of the genotype matrix was further cleaned in order to avoid the possible non-singular problem regarding the following aspects: (1) columns corresponding non-polymorphic SNPs were excluded, and (2) given limited sample size, only columns with unique SNP data were retained whereas duplicated ones were eliminated. These quality assurance steps resulted in 26,831 genes in consideration.

Statistical approach
The adaptive-weight burden test (ABT) is a retrospective, mixed model test for genetic association of quantitative traits on genotype data with complex correlations. In the GENEVA dental caries study, a group of individuals with known pedigree information were sampled for phenotype (the DMFT index), covariates (sex, age, etc), and genotypes (multiple SNPs in pre-defined gene regions). Under the gene-based testing setup, we denote the phenotype vector by Y, covariate matrix by Z, and genotype matrix by G (for each gene, size n × m . Then, the ABT analysis is conducted on a retrospective paradigm G|(Y,Z), i.e., by treating genotypes as random response in regression. The main advantage of such a retrospective model lies in its ability to directly model genotypic correlations across both SNPs (caused by LD) and samples (due to familial relation) [21], as well as to maintain robustness to phenotype model misspecification [22].
It has been shown that the ABT statistic Φ is the kinship matrix of the sampled individuals, and, σ e 2 , σ a 2 stand for variance due to random measurement error and variance attributed to additive polygenic random effects, respectively. The covariance matrix of multiple SNPs in a pre-defined gene region is denoted as DRD, where R is the LD correlation matrix and D = diag σ j , 1 ≤ j ≤ m is a diagonal matrix of the standard deviations of the SNPs in genotype G. ABT adopts adaptive weights to collapse multiple SNPs in each pre-defined gene region to improve power of association testing. Its alternative view is a kernel test with the generalized Madsen- . The null distribution of S ABT can be explicitly derived as a mixture of χ 1 2 random variables each obtained from the eigen value of the matrix, WG T PGW, where In order to justify the results from the gene-based association analysis, we compared the gene set identified by gene-based test with that by SNP-based test. For consistency reasons, we chose to use MASTOR as the SNP-based test. MASTOR is a mixed-model, retrospective score test for genetic association with quantitative traits in samples with related individuals [22]. When using such a single-SNP test, we consider a gene to be significantly associated with the DMFT trait if there exists at least one SNP in that gene region with p-value less than the Bonferroni corrected nominal α=0.05/(#SNP in the gene), after adjusting for non-genetic covariates. Since all other settings for this study, including the samples, covariates, phenotypes/genotypes, and the retrospective analytical strategy, keep unchanged from the gene-based analysis, the gene sets identified by ABT and by MASTOR should be comparable. A simple chi-square test for independence can then be conducted with a 2×2 contingency table formed with the significant/insignificant number of genes identified by these two methods.

Results
The GENEVA dental caries data (dbGaP accession: phs000095. v3.p1) include 5,291 phenotyped and 4,020 genotyped individuals. In this study, we focus on a subset of 652 participants who have complete data in the following phenotypic characteristics: gender, age, education group, water source, presence/absence of S. mutans, home tap water fluoride level, saliva flow, brush frequency, and the Decay-Missing-Filled (DMF) index. These characteristics are summarized in Table 1.
We note that, the 652 participants include both children and adults. The age ranges from 7 to 61 years, and 406 out of the total 652 are older than 18 at the time of examination. The majority of these participants are white (648 white, two multi or bi-racial, and two with missing values), therefore race was not included as a covariate in our study.
We use the adaptive-weight burden test [15] to perform gene-based association testing for the DMFT index in the 652 participants, adjusting for the above eight non-genetic covariates and properly addressing various types of genotypic correlations caused by both LD and familial relation. This test first obtains the transformed phenotypic residual from the phenotype model under the null hypothesis, and then collapses the genotypes of multiple SNPs in each gene region by using data-adaptive weights to achieve a powerful retrospective, gene-based test (details are provided in Statistical approach).
In the step of null phenotype model identification, two variance-component parameters and 10 fixed-effect regression coefficients for the non-genetic covariates (two indicators for water source categories) are estimated by the maximum likelihood method. These estimates are shown in Table 2. In covariance estimation, we notice that the estimated value of σ a 2 / σ 2 , i.e., the narrow-sense heritability, is about 0.44, which is comparable with traditional heritability estimates of DMFT (or DMFS) in the permanent dentition of other family-based dental caries GWASs [23][24][25]. In covariates effects estimation, we see that at nominal level 0.05, three covariates sex, age, and S. mutans are significantly associated with the DMFT trait with positive coefficient estimates, indicating that patients with male gender, younger age, and absence of S. mutans are more likely to have lower DMFT measurements. The brush frequency, while expected to be positively associated with dental caries, seems to have a "boundary" effect, which turns out to be significant at nominal level 0.1 but not at 0.05.
respectively. In Figure 2, the resulting genomic inflation factor λ is reported as 1.05 which is generally considered benign [26], suggesting that the gene-based p-values did not show substantial departure from the uniform distribution. Therefore, the extent of inflation due to population stratification or other confounders is negligible. Table 3 lists the 10 top genes according to ranked ABT p-values for the GENEVA dental caries data. Among these 10 genes, PTPRD (MIM: 601598) has been recently reported to be associated with smooth and pit-and-fissure surface caries in the primary dentition in children by a single-SNP based GWAS (reported SNP: rs10958998, intronic, [27]). The PTPRD gene encodes a member of the PTP (protein tyrosine phosphatase) family which is known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation.
Our study also identified five other genes that were reported in the GWAS catalogue [28] to be relevant to cariogenesis, namely FHIT ( The comparison of gene-based test and SNP-based test shows that, out of the total 26,831 genes, 1,053 exhibit significance by MASTOR, slightly less than the 1,468 genes found by ABT, and the intersection set, i.e., identified by both MASTOR and ABT, contains 688 genes. The contingency table is shown in Table 4, which gives a very significant p-value (< 2.2 × 10 −16 ), indicating that the results from these two types of tests are not independent.

Discussion
Traditional GWASs rely on screening the genome on the basis of SNPs. Though such a SNP-based association testing strategy has been shown successful in identifying susceptibility loci for several complex genetic diseases [31,32], challenges in GWASs still exist: First, SNP-based testing detects only marginal effects and may be underpowered to evaluate rare-variant effects due to their low allele frequencies. Second, since SNP-based testing only reports significant SNPs, identification of genes is usually ad hoc, depending on the relative location (intron, intergenic, noncoding, up/downstream, regulatory region, etc) of the identified SNPs to target genes, and interpretation of genetic effects at the gene level remains elusive. In contrast, gene-based testing assesses joint effects of multiple variants in a predefined gene region. Compared with SNP-based testing, gene-based testing has several appealing features. First, shifting the testing unit from SNP to gene generates more interpretable and replicable findings in gene function [33] and gene-gene interaction [34]. Second, by aggregating small signals from each SNP variant, gene-based testing usually achieves improved power, especially for low-frequency minor alleles [35]. Third, chance findings due to multiple testing will be reduced by using gene-wide instead of genome-wide significance level [10]. Finally, gene-based testing also lends itself to meta-analysis of combined data from multiple studies [36]. For these reasons, gene-based association testing is believed to be a natural approach for association analysis in the post-GWAS era of dense genotyping and fine mapping [37].
In this study, we performed gene-based association testing for dental caries data from the GENEVA consortium, with adjustment of phenotype covariates and accounting for LD in SNPs and familial relation in samples. We observed suggestive associations between the DMFT trait and several genes, some of which have been found to have plausible biological functions relevant to cariogenesis. Three non-genetic covariates sex, age, and S. mutans were found significantly associated with the DMFT trait, and the narrow-sense heritability was found comparable with traditional heritability estimates in previous family-based dental caries GWASs. Due to the differences between study designs and testing methods, it is not reasonable to compare the gene set identified by this study with that by other SNP-based GWASs. However, we attempted to compare the results from both gene-and SNP-based association testing on the basis of the GENEVA dental caries data. The comparison revealed that the gene-based test captured 65.34% genes that are significant in the SNP-based test. A further test for independence illustrated that, though gene-based association testing has a totally different mechanism, the identified genes are significantly overlapped with those by SNP-based association testing, given that two tests are performed on the same GENEVA data. Therefore, findings from this gene-based association testing may contain important signals relevant to cariogenesis and could complement those from the traditional SNP-based GWASs.

Acknowledgement and Funding
This genome-wide association study is part of the Gene Environment Association Studies (GENEVA) program of the trans-NIH Genes, Environment and Health Initiative (GEI). Genotyping services were provided by the Center for Inherited Disease Research (CIDR). Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVA Coordinating Center and by the National Center for Biotechnology Information (NCBI). Data and samples were provided by:    Note: * σ 2 = σ a 2 + σ e 2 † : Brush frequency coded as: 1= more than 3 times per day, 2 = 3 times per day, 3 = 2 times per day, 4 = once a day, 5 = less than once a day. In the final samples, category 1 has zero observations, so category 2 is treated as the reference group. § : Education level is coded as: 1 = up to high school, 2 = some college, 3 = four-year degree or beyond.